InfoMagic Internet Tools 1993 July

home *** CD-ROM | disk | FTP | other *** search

/ InfoMagic Internet Tools 1993 July / Internet Tools.iso / RockRidge / security / Watcher / Docs / Paper < prev next >

Wrap

Text File | 1987-05-19 | 13.5 KB | 286 lines

.TI Keeping watch over the flocks .TI by night (and day) .AU 7 Kenneth Ingham University of New Mexico Computing Center Distributed Systems Group 2701 Campus NE Albuquerque, NM 87131 (505) 277-8044 ingham@charon.unm.edu or ucbvax!unmvax!charon!ingham .AB Over the last several years, the number of machines maintained by the University of New Mexico Computing Center has increased rapidly, yet the number of system managers monitoring these systems has remained static. Consequently, the system managers were faced with the task of watching more and more machines; since only one system manager is on call at any time (known affectionately as "DOC"), this soon proved to be an unacceptable situation. Shell scripts running every six hours gave some assistance; this was offset by the fact that the scripts generated a great deal of output indicating normal system operation, which the system manager still had to scan carefully for signs of trouble. This paper describes \fIwatcher\fR, a flexible system monitor which watches the system more closely than the human system manager while generating less output for him to examine. .sp Running more often than the above mentioned set of shell scripts, \fIwatcher\fR is able to keep closer tabs on the system; since it delivers only a list of potential problems, however, this extra monitoring produces \fIno\fR corresponding increase in the demand on DOC. No problems slip by unnoticed in the more concise output, leading to an improvement in overall system availability as well as the more effective utilization of the system manager's time. .BD .SE 0. Acknowledgments (I couldn't have done it without you) I would like to thank Leslie Gorsline for her assistance in the writing of this paper. Without her, this paper might not have been. Also thanks to the UNMCC distributed systems group for their comments that helped improve \fIwatcher\fR. .SE 1. Background (the problem) The computing facilities offered by the University of New Mexico Computing Center (UNMCC) include three microvaxen, five large vaxen (780 or bigger), and a Sequent B8000. In addition to these Unix/VMS machines, the UNMCC Distributed Systems Group (DSG) monitors a number of the various microvaxen and sun workstations scattered across campus. This duty falls to the DSG Programmer designated as "DOC", or "DSG On Call", who receives his beeper based on a monthly rotation schedule. .sp In the past, shell scripts running every six hours reported various system statistics to DOC, who then scanned the output for signs of possible trouble. The output of these shell scripts became overwhelming as the number of machines and potential problems grew; corresponding to this increase in output was an increase in the amount of time that DOC had to spend reading this output. In addition, most of this output merely indicated normal system operation; potential problems were buried amongst non-problems. Because of this, DOC could often waste a tremendous amount of time wading through system status reports, time which can be better spent actually fixing system problems. .sp Unix is equipped with many powerful tools for program development, but none which simply watch the system for signs of trouble. Programs like \fIps\fR and \fIdf\fR provide information regarding the current state of the machine, yet it still remains DOC's responsibility to interpret this information and assess the health of the system at any given time. This deficiency can be rectified by providing the system with the capacity to determine its own state of health, advising DOC when it notices a problem which requires DOC's intervention. .SE 2. Design Goals (devising the solution) In designing \fIwatcher\fR, the author closely examined just what DOC does in monitoring the system; just how \fIdoes\fR DOC spot potential trouble in the DOC reports? These reports consist of output from \fIdf -i\fR, \fIruptime\fR, \fIps -aux | sort\fR, and the tail of \fIcronlog\fR, which usually only changes in the middle of the night. It was determined that DOC's task consisted primarily of scanning various numbers in this output, deciding whether or not they had exceeded an allowable maximum or minimum, or if the values had changed too much from the last time the command was run, assuming the last value is even remembered. Getting a computer to do this is more complicated than might seem at first glance, due to inconsistencies in the location of pertinent information between runs of these commands. For instance, the process occupying the fifth line of \fIps -ax\fR might next time appear on the eighth line; similarly, \fIuptime\fR does not consistently put germane information in the same place on the line. .sp While flexibility is certainly a primary design consideration, it is not the whole story. In order to improve DOC's effectiveness, the program should run frequently, roughly every two or three hours, catching problems early (hopefully before they have affected the users). Thus, the program should also be as silent as possible except when it detects a potential problem; any advantage DOC gains in using \fIwatcher\fR would be eliminated if the program delivered an exceedingly verbose status report every two hours. \fIwatcher\fR's problem reports should be exact and concise, leading DOC immediately to the trouble. .sp The problem of reducing the amount of output DOC must process can be approached in different ways, including the redesign of the current shell scripts. A simple \fIawk\fR script can watch the output from \fIdf\fR [1]. However, each command would require a custom tailored \fIawk\fR script to look at it. This task grows more complicated as the number of programs running increases. While a program could be written to generate these \fIawk\fR scripts, this process is needlessly complex; for only a bit more work, an efficient C program such as \fIwatcher\fR can be developed. .SE 3. Design (actual implementation of the solution) Run at intervals specified in \fIcrontab\fR, \fIwatcher\fR parses a control file (./\fIwatcherfile\fR by default) with a \fIyacc\fR generated parser, building a data structure containing all of the information from the file. The file contains the list of commands \fIwatcher\fR should run (the pipeline), output specifications for each command (the output format), and the guidelines used in determining if something is amiss and should be reported to DOC (the change format). A sample \fIwatcher\fR control file would look something like this (comment lines begin with a '#'): .EX # Here is the pipeline and its alias: (df -i | /usr/ucb/tail +2) { df } # the output format; this is a column output format: $1-9 device%k $41-42 spaceused%d $64-65 inodesused%d: # and the change format: spaceused 15%; spaceused 0 89; inodesused 15%; inodesused 0 49. # another command example: (/usr/ucb/ruptime | fgrep -f UnmHosts) { ruptime } # this is a relative output format 2 status%s 1 machine%k 7 loadav%d: # and another change format: loadav 0 10; status "up". .NX The first entry causes \fIwatcher\fR to run the \fIdf\fR pipeline listed in parentheses. When reporting problems, \fIwatcher\fR refers to this command by the alias provided in the braces; if no alias appears, \fIwatcher\fR uses the entire pipeline. .sp The output format instructs \fIwatcher\fR how to parse the output; column format, indicated in the output format by \fBnum-num\fR, instructs \fIwatcher\fR that the output should be parsed by columns, while relative format, denoted by a single integer, shows that the output should be broken up by whitespaces. Through the convention \fBname%type\fR, the output format also names each field, indicating whether the field is numeric, string, or keyword, specified by \fBd\fR, \fBs\fR, or \fBk\fR respectively. Keyword fields are used to match up corresponding output lines between runs. Thus .EX 41-42 spaceused%d .NX indicates that this field, named \fBspaceused\fR, contains numeric information in columns 41-42, while .EX 2 status%s .NX informs \fIwatcher\fR that the second word (group of non-whitespace characters) on the line is a string field named \fBstatus\fR. For the \fIdf\fR example given above, .EX Filesystem kbytes used avail capacity iused ifree %iused Mounted on /dev/hp1f 52431 39763 7424 84% 6937 9447 42% /develop .NX \fBdevice\fR would be \fI/dev/hp1f\fR, \fBspaceused\fR would be 84, and \fBinodesused\fR would be 42. Similarly, the output from the \fIruptime\fR example, which looks like this .EX charon up 26+07:53, 17 users, load 3.12, 2.90, 2.66 .NX would be broken at the following places: .EX charon | up | 26+07:53, | 17 | users, | load | 3.12, | 2.90, | 2.66, .NX assigning "up" to \fBstatus\fR, and 3.12 to \fBloadav\fR. .sp The name field also appears in the change format, designating allowable values for this field to have. These values can be specified as single character strings in the case of string fields; in the case of numeric fields, the values take the form of either percentage or absolute changes, or a minimum and maximum which delineate an acceptable range. Thus .EX inodesused 15%; inodesused 0 49. .NX signifies that DOC should be notified if the field named \fBinodesused\fR increases by more than 15% from the last run, or if it is outside the range 0 to 49; similarly .EX status "up"; .NX informs \fIwatcher\fR to notify DOC if the \fBstatus\fR field contains anything other than the word "up". .sp As \fIwatcher\fR parses the output of a pipeline, it stores the pertinent parts of the output in a history file (by default, ./\fIwatcher.history\fR). The next time \fIwatcher\fR runs, it reads this file to provide comparison values for the command. If a command is new (i.e. it has no previously-stored output in the history file), \fIwatcher\fR checks the fields which require no previous data, such as min-max fields, while still storing \fIall\fR of the relevant information to the history file. Thus, the next time the new command is run, it will be an \fIold\fR command, and meaningful between-run comparisons can be made. .sp When \fIwatcher\fR detects no problems with the system, DOC receives an empty mail message with the subject "\fIhostname\fR had no problems at \fIdate\fR"; this is to insure that \fImail\fR is running correctly. When it notices a problem which should be brought to DOC's attention, it mails the system problem report in a concise format, explaining what is wrong and why. Thus, rather than the megabytes of shell script output that DOC used to receive and have to read, he merely sees this when he reads his mail: .EX Mail version 5.2 6/21/85. Type ? for help. "/usr/spool/mail/ingham": 5 messages 5 new N 1 root@charon.unm Sat Apr 11 16:00 8/212 "charon had no problems at Sat" N 2 root@ariel.unm Sat Apr 11 16:00 8/208 "ariel had no problems at Sat " N 3 root@geinah.unm Sat Apr 11 16:00 11/417 "System problem report for gei" N 4 root@izar.unm Sat Apr 11 16:00 8/204 "izar had no problems at Sat A" N 5 root@deimos.unm Sat Apr 11 16:00 8/212 "deimos had no problems at Sat" .NX The letters indicating no problems can be immediately deleted, and DOC can turn his attention to the letter indicating a system problems. A sample problem report would look something like this: .EX df has a max/min value out of range: /dev/hp0h 140488 111195 15244 91% 10145 28767 26% /usr where spaceused = 91.00; valid range 0.00 to 89.00. Also it had inodesused change by more than 10%. Previous value 20.00; current value 26.00. .NX Note that if a line has more than one indication of a problem, all anomalies are included in the report. This provides DOC with as much information as possible, allowing him to determine the problem quickly and devise a rapid fix (hopefully before users know something is amiss). .sp .SE 4. Results (how its helped us) \fIwatcher\fR's primary advantage lies in the reduction of DOC's work load. It has taken over the more menial aspects of monitoring a system, tasks like reading and comparing numbers, giving DOC more time to concentrate on bugs of a nature which \fIwatcher\fR isn't set up to monitor, such as problems in the accounting system. DOC is apprised of potential problems quickly, and in some cases can repair them in less time than simply reading the shell script output would have taken. .sp The ability to monitor changes between runs has also helped bring to our attention some problems which were missed in the DOC reports. For example, disk space on \fI/u2\fR on one of our machines jumped by more than 15%. Since this jump did not force the total space used above 90%, at which point DOC would have investigated the filesystem, it is unlikely that DOC would have even noticed this sudden change. The facility to watch for relative changes between runs enables DOC to catch problems in their infancy, and fix problems such as filesystems filling up too rapidly before they inconvenience the users. .sp Since the system manager specifies not only the commands \fIwatcher\fR will execute and the time lapse between successive runs, but also the parameters which indicate system anomalies, \fIwatcher\fR can easily be seen as a very flexible, general system monitor. Its use at UNM has provided an increase in the productivity of the system manager, which has led in turn to the increase in the reliability and availability of the systems at UNMCC. .SE 5. Availability (how to get one) \fIwatcher\fR will be sent to the moderator of mod.sources after the conference is over. .SE 6. References (you might also find this interesting) .in +0.5i .ti -0.5i [1] Monitoring Free Disk Space, Rik Farrow, Wizard's Grabbag, \fIUnix World\fR, Vol. IV, no. 3, pp. 86-87. .in -0.5i